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Abstract 


This paper considers convex optimization problems where nodes of a network have access to 
summands of a global objective. Each of these local objectives is farther assumed to be an average 
of a finite set of fnnctions. The motivation for this setnp is to solve large scale machine learning 
problems where elements of the training set are distributed to multiple computational elements. 

The decentralized double stochastic averaging gradient (DSA) algorithm is proposed as a solution 
alternative that relies on: (i) The use of local stochastic averaging gradients, (ii) Determination of 
descent steps as differences of consecutive stochastic averaging gradients. Strong convexity of local 
functions and Lipschitz continuity of local gradients is shown to guarantee linear convergence of 
the sequence generated by DSA in expectation. Local iterates are further shown to approach the 
optimal argument for almost all realizations. The expected linear convergence of DSA is in contrast 
to the sublinear rate characteristic of existing methods for decentralized stochastic optimization. 
Numerical experiments on a logistic regression problem illustrate reductions in convergence time 
and number of feature vectors processed until convergence relative to these other alternatives. 
Keywords: Decentralized optimization, stochastic optimization, stochastic averaging gradient, 
logistic regression. 

1. Introduction 

We consider machine learning problems with large training sets that are distributed into a network 
of computing agents so that each of the nodes maintains a moderate number of samples. This leads 
to decentralized consensus optimization problems where summands of the global objective function 
are available at different nodes of the network. In this class of problems agents (nodes) try to 
optimize the global cost function by operating on their local functions and communicating with 
their neighbors only. Specifically, consider a variable x G and a connected network of size N 
where each node n has access to a local objective function /„ : —>■ R. The local objective function 

/„(x) is defined as the average of local instantaneous functions /„^i(x) that can be individually 
evaluated at node n. Agents cooperate to solve the global optimization 



( 1 ) 


The formulation in 0 models a training set with a total of On training samples that are 

distributed among the N agents for parallel processing conducive to the determination of the optimal 
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classifier x* (IBekkerman et al.l (|20111) ; iTsianos et al.l (l2012ari ; ICevher et alJ (12014) ). Although we 


make no formal assumption, in cases of practical importance the total number of training samples 
J2n=i very large, but the numbers of elements available at a specific node are moderate. 

Our interest here is in solving © with a method that is decentralized - nodes operate on their 
local functions and communicate with neighbors only -, stochastic - nodes determine a descent 
direction by evaluating only one out of the (/„ functions fn,i at each iteration -, and has a linear 
convergence rate in expectation - the expected distance to the optimum is scaled by a subunit factor 
at each iteration. 

Decentralized optimization is relatively mature and various methods are kno wn with complemen¬ 
tary a d vantages. These metho d s include dec e ntrali zed gradient descent I D GDI (iNedic and Ozdagl^ 


( 2009ll : Jakovetic et al. ( 2014 ): Shian et aP (20131 ), network Newton dMokhtari et ai] ~( 2015ai bll'). 


decentralized dual a veraging (IDuchi et al.l (|2012l) : iTsianos et al.l (l2012bl) ). the exact first order al¬ 


gorithm ( EXTRA) (Shi et al.l (2015 )), a s well a s the alternat i ng dir ection method of multipliers 
(ADMM) (iBovd et ^ ( 2011 ): Shi et al. ( 2014 ): lutzeler et al. (201^) and its linearized variants 
( Ling and RibeirJ l 2014l) ~ Ling et al. ( 2014) : Mokhtari et al. ( 2015cl) ). The ADMM, its variants, 
and EXTRA converge linearly to the optimal argument but DGD, network Newton, and decentral¬ 
ized dual averaging have sublinear convergence rates. Of particular importance to this paper, is the 
fact that DGD has (inexact) linear converge to a neighborhood of the optimal argument when it 
uses constant stepsizes. It can achieve exact convergence by using diminishing stepsizes, but the 
convergence rate degrades to sublinear. This lack of linear convergence i s solved b y EXT RA through 
the use of iterations that rely on information of two consecutive steps ( Shi et aP ( 2015 )). 

All of the algorithms mentioned above require the computationally costly evaluation of the local 
gradients V/„(x) = (!/(?«) X]i=i This cost can be avoided by stochastic decentralized 

algorithms that reduce computational cost of iterations by substituting all local gradients with their 
stochastic approximations. This reduces the computational cost per iteration but results in sublinear 
convergence rates of order 0{l/t) even if the corresponding deterministic algorithm exhibits linear 
convergence. This is a drawback that also exists in centralized stochastic optimization where linear 
convergence rates in expectation me es t a.blished by dec r easing the variance of the stochas t ic gra ¬ 
dient approximation dRo ux et al. (2012): Schmidt et al.l (20131): Shalev-Sh wartz and Zhanj ( 2013 ): 
Johnson and Zhaii^ ( 2013 ); Konecnv and Richtarik ( 2013 ): lDefazio et al.l (12014 )). In this pap er we 


build on the ideas of the sto chastic averag i ng gra dient (SAG) algorithm ((Schmidt et al.l (|2013l) ) and 
its unbiased version SAGA ( Defazio et al.l ( 2014 )). Both of these algorithms use the idea of stochas¬ 
tic incremental averaging gradients. At each iteration only one of the stochastic gradients is updated 
and the average of all of the most recent stochastic gradients is used for estimating gradient. 

The contribution of this paper is to develop the decentralized double stochastic averaging gradient 
(DSA) method, a novel decentralized stochastic algorithm for solving ([T|). The method exploits a 
new interpretation of EXTRA as a saddle point method and uses stochastic averaging gradients in 
lieu of gradients. DSA is decentralized because it is implementable in a network setting where nodes 
can communicate only with their neighbors. It is double because iterations utilize the information of 
two consecutive iterates. It is stochastic because the gradient of only one randomly selected function 
is evaluated at each iteration and it is an averaging method because it uses an average of stochastic 
gradients to approximate the local gradients. DSA is proven to converge linearly to the optimal 
argument x* in expectation. This is in contrast to all other decentralized stochastic methods to 
solve m that converge at sublinear rates. 

We begin the paper with a discussion of DGD, EXTRA and stochastic averaging gradient. With 
these definitions in place we define the DSA algorithm by replacing the gradients used in EXTRA 
by stochastic averaging gradients (Section [2]). We follow with a digression on the limit points of 
DGD and EXTRA iterations to explain the reason why DGD does not achieve exact convergence 
but EXTRA is expected to do so (Section 12.ip . A reinterpretation of EXTRA as a saddle point 
method that solves for the critical points of the augmented Lagrangian of a constrained optimization 
problem equivalent to © is then introduced. It follows from this reinterpretation that DSA is a 
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stochastic saddle point method fSection The fact that DSA is a stochastic saddle point method 
is the critical enabler of the subsequent convergence analysis (Section [3]). In particular, it is possible 
to guarantee that strong convexity and gradient Lipschitz continuity of the local instantaneous 
functions fn,i imply that a Lyapunov function associated with the sequence of iterates generated 
by DSA converges linearly to its optimal value in expectation (Theorem [6]) . Linear convergence in 
expectation of the local iterates to the optimal argument x* of o follows as a trivial consequence 
(Corollary [7]). We complement this result by showing convergence of all the local variables to the 
optimal argument x* with probability 1 (Theorem |8]). 

The advantages of DSA relative to a group of stochastic and deterministic alternatives in solving 
a logistic regression problem with a synthetic dataset are then studied in numerical experiments 
(Section 3]). These results demonstrate that DSA is the only decentralized stochastic algorithm that 
reaches the optimal solution with a linear convergence rate. We further show that DSA outperforms 
deterministic algorithms when the metric is the number of times that elements of the training set 
are evaluated. The behavior of DSA for different network topologies is also evaluated. We close the 
paper with pertinent remarks (Section [SJ. 

Notation Lowercase boldface v denotes a vector and uppercase boldface A a matrix. For column 
vectors Xi,..., xjv we use the notation x = [xi;...; xjv] to represent the stack column vector x. We 
use ||v|| to denote the Euclidean norm of vector v and || A|| to denote the Euclidean norm of matrix A. 
For a vector v and a positive definite matrix A, the A-weighted norm is defined as ||v|| a := Vv^Av. 
The null space of matrix A is denoted by null(A) and the span of a vector by span(x). The operator 
Ex[-] stands for expectation over random variable x and E[-] for expectation with respect to the 
distribution of a stochastic process. 


2. Decentralized Double stochastic averaging gradient 


Consider a connected network that contains N nodes such that each node n can only communicate 
with peers in its neighborhood Afn- Define x„ G R*’ as a local copy of the variable x that is kept 
at node n. In decentralized optimization, agents try to minimize their local functions fn(p^n) while 
ensuring that their local variables x„ coincide with the variables x^ of all neighbors m G Afn - which, 
given that the network is connected, ensures that the variables x„ of all nodes are the same and 
renders the problem equivalent to O- DGD is a well known method for decentralized optimization 
that relies on the introduction of nonnegative weights Wij > 0 that are not null if and only if m = n 
or if TO G Afn- Letting t G N be a discrete time index and a a given stepsize, DGD is defined by the 
recursion 

N 

^ ( Wnm-X-^ — aV/n(x„), n = 1, . . . , N. (2) 

m—1 


Since Wnm = 0 when m ^ n and to ^ Afn, it follows from ([5]) that node n updates x„ by performing 
an average over the variables x(„ of its neighbors to G Nn and its own x(j, followed by descent 
through the negative local gradient —Vfn{xl^). If a constant stepsize is used, DGD iterates x(j 
approach a neighborhood of the optimal argument x* of CD but don’t converge exactly. To achieve 
exact convergence diminish ing stepsizes are used but the resulting convergence rate is sublinear 
( Nedic and Ozdagla^ ( 2009l lL 

EXTRA is a method that resolves either of these issues by mixing two consecutive DGD iterations 
with different weight matrices and opposite signs. To be precise, introduce a second set of weights 
Wnm. with the same properties as the weights Wnm and define EXTRA through the recursion 


A+i 


= x„ 


N 

■E' 

m—1 




N 

E^ 

m—1 


-a [V/„(x(J - V/„(x(, 1)] , n = l,...,N. (3) 
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Figure 1: Stochastic averaging gradient table at node n. At each iteration t a random local instanta¬ 
neous gradient ^ (y^ ) is updated by (x^)- The rest of the local instantaneous 

gradients remain unchanged, i.e., J for i ^ i\. This list is used 

to compute the stochastic averaging gradient in ©• 


Observe that (|3]) is well defined for t > 0. For t = 0 we utilize the regular DGD iteration in ([2|). In 
the nomenclature of this paper we say that EXTRA performs a decentralized double gradient descent 
step because it operates in a decentralized manner while utilizing a difference of two gradients as 
descent direction. Minor modification as it is, the use of this gradient difference in lieu of simple 
gradients, en dows EXTRA w ith exact linear convergence to the optimal argument x* under mild 
assumptions ( Shi et al. ( 2015h b 

If we recall the definitions of the local functions /n(x„) and the instantaneous local functions 
/rt,i(xn) available at node n, the implementation of EXTRA requires that each node n computes 
the full gradient of its local objective function /„ at as 


V/„(x^) = 


Qn 


Qn 

E 

2=1 


V/„,.(x^). 


(4) 


This is computationally expensive when the number of instantaneous functions is large. To resolve 
this issue, local stochastic gradients can be substituted for the local objective functions gradients in 
@. These stochastic gradients approximate the gradient V/„(x„) of node n by randomly choosing 
one of the instantaneous functions gradients V/„y(x„). If we let S {1,... g„} denote a function 
index that we choose at time t at node n uniformly at random and independently of the history of 
the process, then the stochastic gradient is defined as 


^(x^) := V/„,i^(x^). 


(5) 


We can then write a stochastic version of EXTRA by replacing V/„(x^) by s„(x^) and V/„(x^“^) 
by s„(x^“^). Such algorithm would have a small computational cost per iteration and, presumably, 
converge to the optimal argument x*. Here however, we want to design an algorithm with linear 
convergence rate, and stochastic descent algorithms achieve sublinear rates because of the difference 
between the stochastic and deterministic descent directions. 


To reduce this noise we propose the use of stochastic averaging gradients instead (jPefazio et al 
( 2014IH . The idea is to maintain a list of gradients of all instantaneous functions in which one 


randomly chosen element is replaced at each iteration and to use an average of the elements of this 
list for gradient approximation; see Figure [TJ Formally, define the variable yn,i G to represent 
the iterate value the last time that the instantaneous gradient of function fn^i was evaluated. If 
we let S {I,..., Qn} denote the function index chosen at time t at node n, as we did in ([5]), the 
variables yn,i are updated recursively as 


y*+i = x‘ 

n,i n'l 


iii = ii 


j+i 


= y; 


n,i ’ 




( 6 ) 
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Algorithm 1 DSA algorithm at node n 

Require: Vectors x°. Gradient table initialized with instantaneous gradients ^fn,i{yZ,i) with y° j = x°. 
1: for t = 0, 1 , 2, ... do 

2: Exchange variable x^ with neighboring nodes m £ Afn- 

3: Ghoose ill uniformly at random from the set {1,..., g„}. 

4: Gompute and store stochastic averaging gradient as per 0: 

^ 9n 

gu = v/„,it^(x‘„) - VUAyi,i)- 

i=i 

5: Store fnii^iAi) in A . gradient table position. 

6: if t = 0 then jv 

7: Update variable xj, as per (|9l): WnmAA^ — agh- 

8: else jY N 

9: Update variable x^ as per dll): X^+^ = X^ + ^ ^ iDnmX^ ^ “ O [gji “ gn ^] ■ 

10: end if 

11: end for 


With these definitions in hand we can define the stochastic averaging gradient at node n as 






-1 Qn 


i(yL)- 


(7) 


Observe that to implement © the gradients V/n,i(y^_j) are stored in the local gradient table shown 
in Figure [H 

The DSA algorithm is a variation of EXTRA that substitutes the local gradients V/n(x^) in (0) 
for the local stochastic average gradients in (O, 


N N 

=^n+Yl - O [g^ - g^"^] . (8) 

m—1 m—1 

The DSA initial update is given by applying the same substitution for the update of DGD in Q as 


N 

- a in- (9) 

m—1 

DSA is summarized in Algorithm [T] for t > 1. The DSA update in ([5]) is implemented in Step 
9. This step requires access to the local iterates x^ of neighboring nodes m G Mn which are 
collected in Step 2. Furthermore, implementation of the DSA update also requires access to the 
stochastic averaging gradients g^“^ and g^. The latter is computed in Step 4 and the former is 
computed and stored at the same step in the previous iteration. The computation of the stochastic 
averaging gradients requires the selection of the index This index is chosen uniformly at random 
in Step 3. Determination of stochastic averaging gradients also necessitates access and maintenance 
of the gradients table in Figure [TJ The A element of this table is updated in Step 5 by replacing 
^ fn,ii^{yh it ) with V/„_i^(x^), while the other vectors remain unchanged. To implement the first 
DSA iteration at time t = 0 we have to perform the update in © instead of the update in ([8]) as 
in Step 7. Furhter observe that the auxiliary variables y° ^ are initialized to the initial iterate x°. 
This implies that the initial values of the stored gradients are V/„,i(y° J = V/„_i(x°) - with a 
consequently relatively large initialization cost. 

We point out that the weights Wnm and Wnm can’t be arbitrary. If we define weight matrices W 
and W with elements Wnm and Wnrm respectively, they have to satisfy conditions that we state as 
an assumption for future reference. 
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Assumption 1 The weight matrices W and W must satisfy the following properties 

(a) Both are symmetric, W = and W = W^. 

(b) The null space o/I —W includes the span ofl, i.e., nullil — 'W) D span{\), the null space of 
I — W is the span of 1 , i.e., nullil — W) = span{'\.), and the null space of the difference W — W 
is the span of 1 , i.e., nullfW — W) = span(l). 

(c) They satisfy the .spectral ordering W < W ^ (I + W)/2 and the matrix W is positive definite 

0 ^ W. 


Requiring the matrix W to be symmetric and with specific null space properties is necessary to 
let all agents converge to the same optimal variable. Analogous properties are necessary in DGD 
and are not difficult to satisfy. The condition on spectral ordering is specific to EXTRA but is not 
difficult to satisfy either. E.g., if we have a matrix W that satisfies all the conditions in Assumption 
[U the weight matrix W = (I + W)/2 makes Assumption [T] valid. 

We also point that, as written in 0, computation of local stochastic averaging gradients is 
costly because it requires evaluation of the sum fn,i{yh i) 6^ch iteration. This cost can 

be avoided by updating the sum at each iteration with the recursive formula 


Qn 

E 

2=1 




( 10 ) 


i=l 


Important properties and interpretations of EXTRA and DSA are presented in the following sections 
after a pertinent remark. 


Remark 1 The local stochastic averaging gradients in 0 are unbiased estimates of the local gra¬ 
dients V/„(x(j). Indeed, if we let Ft measure the history of the system up until time t we have that 
the sum in ([7]) is deterministic given this sigma-algebra. Thus, the conditional expectation of the 
stochastic averaging gradient is, 


E [gi I J-‘] = E (x(,) I .F* - E {yl,, ) | F 


1 Hn 

— ^^fnAyl 


.)■ 


( 11 ) 


With the index chosen equiprobably from the set {1,..., Qyj}, the expectation of the second term 
in dm is the same as the sum in the last term - each of the indexes is chosen with probability l/q„. 
Therefore, these two terms cancel out each other and, since the expectation of the first term in (HU) 
is simply E[V/„,it^ (x(^) | F*] = (l/q„) YAAi ^ fnffAi) = ^ can simplify (HD) to 

E[g(,|F‘] =V/„(x(,). (12) 

The expression in (fT^ means, by definition, that g(j is an unbiased estimate of V/„(x(j) when the 
history F"* is given. 


2.1 Limit points of DGD and EXTRA 

The derivation of EXTRA hinges on the observation that the optimal argument of o is not a 
fixed point of the DGD iteration in ([2]) but is a fixed point of the iteration in ([3|). To explain this 
point define x := [xi;... ;xjv] G as a vector that concatenates the local iterates x„ and the 
aggregate function / : —>• R as the one that takes values /(x) = /(xi,..., xjv) := X)n=i fnAn)- 

Decentralized optimization entails the minimization of /(x) subject to the constraint that all local 
variables are equal, 

N 

X* :=argmin / (x) = /(xi,.. .,Xn) = /n(x„), 

71 = 1 

s. t. Xn = Xm, for all n, m. (13) 
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The problems in © and 0 are equivalent in the sense that the vector x* G is a solution of 
(US if it satisfies x* = x* for all n, or, equivalently, if we can write x* = [x*;... ;x*]. Regardless 
of interpretation, the Karush, Kuhn, Tucker (KKT) conditions of (TT^ dictate that that optimal 
argument x* must sastisfy 

X* C span(lAr (g) Ip), (liv 0 Ip)^V/(x*) = 0. (14) 

The first condition in (fHl) requires that all the local variables x* be equal, while the second condition 
requires the sum of local gradients to vanish at the optimal point. This latter condition is not the 
same as V/(x) = 0. If we observe that the gradient V/(x‘) of the aggregate function can be written 
as V/(x) = [V/i(xi);...; V/Ar(xAr)] G the condition V/(x) = 0 implies that all the local 

gradients are null, i.e., that V/„(x„) = 0 for all n. This is stronger than having their sum being 
null as required by da. 

Define now the extended weight matrices as the Kronecker products Z := W (g I G 
and Z := W (g I G R^p><^p. Note that the required conditions for the weight matrices W and 
W in Assumption [1] enforce some conditions on the extended weight matrices Z and Z. Based on 
Assumption (TJa), the matrices Z and Z are also symmetric, i.e., Z = Z^ and Z — Z^. Conditions in 
Assumption [T](b) imply that null{Z —Z} = span{l(gl}, nulljl—Z} = span{l(gl}, and null{I—Z} D 
span{l ig I}. Lastly, the spectral properties of matrices W and W in Assumption [ijc) yield that 
matrix Z is positive definite and the expression Z A Z A (I + Z)/2 holds. 

According to the definition of extended weight matrix Z, the DGD iteration in ([2]) is equivalent 
to 

x*+i = Zx* - aV/(x‘), (15) 

where, according to m, the gradient V/(x‘) of the aggregate function can be written as V/(x‘) = 
[V/i(x|);...; V/ 7 v(x)y)] G Likewise, the EXTRA iteration in ([3]) can be written as 

x‘+i = (I + Z)x‘ - Zx‘-i - a [V/(x‘) - V/(x*-i)] . (16) 

The fundamental difference between DGD and EXTRA is that a fixed point of (fThll does not nec¬ 
essarily satisfy da, whereas the fixed points of (fT6|l are guaranteed to do so. Indeed, taking limits 
in m we see that the fixed points x°° of DGD must satisfy 

(I - Z)x°° -t aV/(x“) = 0, (17) 

which is incompatible with m except in peculiar circumstances - such as, e.g., when all local 
functions have the same minimum. The limit points of EXTRA, however, satisfy the relationship 

x°° - x°“ = (Z - Z)x°° - a[V/(x°°) - V/(x“)]. (18) 

Ganceling out the variables on the left hand side and the gradients in the right hand side it follows 
that (Z — Z)x°° = 0. Since the null space of of Z — Z is null(Z — Z) = l^r (g Ip by assumption, we 
must have x°° C span(lAr (g Ip). This is the first condition in ([731) . For the second condition in (1131) 
sum the updates in (1161) recursively and use the telescopic nature of the sum to write 

t 

x‘+i = Zx*-aV/(x‘)-^(Z-Z)x®. (19) 

Substituting the limit point in ([19]) and reordering terms, we see that must satisfy 

oo 

aV/(x°°) = (I-Z)x°°-y](Z-Z)x". (20) 

s=0 
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In (pH)) we have that (I — Z)x°° = 0 because the null space of (I — Z) is null(Z — Z) = l^r (g) Ip 
by assumption and x°° C span(lAr ® Ip) as already shown. Implementing this simplification and 
considering the multiplication of the resulting equality by {In ^ Ip)^ we obtain 

OO 

(1^ ® Ip)^aV/(x°°) = - ^(1^ 0 Ip)^(Z - Z)x^ (21) 

s=0 

In (I^Tl) . the terms (ljv(g)Ip)^(Z —Z) = 0 because the matrices Z and Z are symmetric and (ljv(g)Ip) 
is in the null space of the difference Z — Z. This implies that (l^v ® Ip)^aV/(x°°), which is the 
second condition in (fT^ . Therefore, given the assumption that the sequence of EXTRA iterates x‘ 
has a limit point x°° it follows that this limit point satisfies both conditions in (fHl) and for this 
reason exact convergence with constant stepsize is achievable for EXTRA. 


2.2 Stochastic saddle point method interpretation of DSA 


The convergence proofs of DSA build on a reinterpretation of EXTRA as a saddle point method. 
To introduce this primal-dual interpretation consider the update in (fT^ and define the sequence of 
vectors v‘ = ^*^q(Z —Z)^/^x®. The vector v* represents the accumulation of variable dissimilarities 
in different nodes over time. Considering this definition of v* we can rewrite (HU) as 


X*+1 = X* - a 


V/(x‘) -(I-Z)x* -b -(Z-Z)i/2v* 


( 22 ) 


Furthermore, based on the definition of the sequence 
recursive expression 


ri+l 


= V 


-(Z- Z)i/2x*+i 
a 


Z)i/2xS can write the 


(23) 


Consider x as a primal variable and v as a dual variable. Then, the updates in ((22ll and (l23ll are 
equivalent to the updates of a saddle point method with stepsize a that solves for the critical points 
of the augmented Lagrangian 


>C(x, v) = /(x) -b -v^(Z - Z)^/^x -b :^x^(I - Z)x. 
a 2a 


(24) 


In the Lagrangian in (IMl) the factor (l/a)v^(Z — Z)^/^x stems from the linear constraint (Z — 
Z)^/^x = 0 and the quadratic term (I/2a)x^(I—Z)x is the augmented term added to the Lagrangian. 
Therefore, the optimization problem whose augmented Lagrangian is the one given in (1241) is 


X* = argmin /(x) 

X 


s.t. -(Z-Z)i/2^ = 0. 
a 


(25) 


Observing that the null space of (Z — Z)^/^ is null((Z — Z)^/^) = null(Z — Z) = spanjlAr O Ip}, 
the constraint in (1251) is equivalent to the consensus constraint x„ = x^ for all n,m that appears 
in m- This means that (E5|) is equivalent to (I13L which, as already argued, is equivalent to the 
original problem in ©• Hence, EXTRA is a saddle point method that solves ()25p which, because 
of their equivalence, is tantamount to solving ©• Considering that saddle point methods converge 
linearly, it follows that the same is true of EXTRA. 

That EXTRA is a saddle point method provides a simple explanation of its convergence prop¬ 
erties. For the purposes of this paper, however, the important fact is that if EXTRA is a sad¬ 
dle point method, DSA is a stochastic saddle point method. To write DSA in this form define 
g* := [g};...; g)^] G as the vector that concatenates all the local stochastic averaging gradients 
at step t. Then, the DSA update in ([5]) can be written as 

x*+i = (I -b Z)x* - Zx*-i - a [g* - g*-i] . 
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Comparing dm and dMl) we see that they differ in the latter using stochastic averaging gradients 
g* in lieu of the full gradients V/(x‘). Therefore, DSA is a stochastic saddle point method in which 
the primal variables are updated as 

x‘+i = X* - ag‘ - (I - Z)x* - (Z - Z)i/2v‘, (27) 

and the dual variables v‘ are updated as 

vt+i = v‘+ (Z - Z)i/2x*+b (28) 

Notice that the initial primal variable x° is an arbitrary vector in while according to the 

definition v* = ~ Z)^/^x®. We then need to set the initial multiplier to v° = (Z — Z)^/^x°. 

This is not a problem in practice because (1271) and (1281) are not used for implementation. In our 
converge analysis we utilize the (equivalent) stochastic saddle point expressions for DSA shown 
in (HZl) and (1^51) . The expression in ([5]) is used for implementation because it avoids exchanging 
dual variables - as well as the initialization problem. The convergence analysis is presented in the 
following section. 


3. Convergence analysis 

Our goal here is to show that as time progresses the sequence of iterates x‘ approaches the optimal 
argument x*. To do so, in addition to the conditions on the weight matrices W and W in Assumption 
[H we assume the instantaneous local functions /„y have specific properties that we state next. 

Assumption 2 The instantaneous local functions /n,i(x„) are differentiable and strongly convex 
with parameter /r. 


Assumption 3 The gradient of instantaneous local functions are Lipschitz continuous with 

parameter L. I.e., for all n G {1,..., N} and i G {1,..., Qn} we can write 


||V/„,*(a) - V/„.i(b)|| < L ||a- b|| a,b G 


(29) 


The condition imposed by Assumption [2] implies that the local functions /n(x„) and the global 


cost function /(x) = X]n=i /n(x„) are also strongly convex with parameter Likewise, Lipschitz 
continuity of the local instantaneous gradients considered in Assumption [3] enforces Lipschitz con¬ 
tinuity of gr adients of the local fun ctions V/ti(x„) and the aggregate function V/(x) - see, e.g., 
(Lemma 1 of iMokhtari et al~ ( 2015a[) i. 


3.1 Preliminaries 

In this section we study some basic properties of the sequences of primal and dual variables generated 
by the DSA algorithm. In the following lemma, we study the relation of iterates x* and v* with the 
optimal primal x* and dual v* arguments. 

Lemma 2 Consider the DSA algorithm as defined in (©-di) and recall the updates of the primal x‘ 
and dual v* variables in and (|28l) , respectively. Further, define the positive semidefinite matrix 
U := (Z — Z)^/^. If Assumption]^ holds true, then the sequence of primal x* and dual v* variables 
satisfy 

a [g* - V/(x*)] ={1 + Z- 2Z)(x* - x*+i) -k Z(x* - x*+i) - U(v‘+i - v*). (30) 

Proof Considering the update rule for the dual variable in (|28ll and the definition U = (Z — Z)^/^, 
we can substitute Uv* in (l27)l by — U^x*+^. Applying this substitution into the DSA primal 

update in ([27|) yields 

ag‘ = -(I -k Z - Z)x*+i -k Zx‘ - Uv*+i. (31) 


9 








Mokhtari and Ribeiro 


By adding and subtracting Zx‘+^ into the right hand side of (l?T|) and considering the fact that 
(I + Z — 2Z)x* = 0 we obtain 

ag* = (I + Z - 2Z)(x* - x*+i) + Z(x* - x‘+i) - Uv‘+b (32) 

One of the KKT conditions of problem (1^51) follows that the optimal variables x* and v* satisfy 
q;V/(x*) + Uv* = 0 or equivalently —q;V/(x*) = Uv*. Adding this equality to both sides of (15^ 
follows the claim in (l30)) . ■ 


In the subsequent analyses of convergence of DSA, we need an upper bound for the expected 
value of squared difference between the stochastic averaging gradient g* and the gradient of optimal 
argument V/(x*) given the observation until step t, i.e. E ||g‘ — V/(x*)||^ | J"* .To establish this 
upper bound first we define the sequence G R as 


N 


P* -=^2 - X*)) 


. Qn 




(33) 


Notice that based on strong convexity of local instantaneous functions fn,i, each term fn,i{yh i) — 
— V/ra,i(x*)^(y^_i — X*) is positive and as a result the sequence p* defined in (IMl) is always 
positive. In the following lemma, we use the result in Lemma [5] to guarantee an upper bound for 
the expectation E [||g‘ — V/(x*)|p | T*] in terms of p* and the optimality gap /(x‘) — /(x*) — 
V/(x*)^(x‘-x*). 

Lemma 3 Consider the DSA algorithm in ©-([ni) and the definition of sequence p* in (1331) . If 
Assumptions [HO hold true, then the squared norm of the difference between stochastic averaging 
gradient g* and the optimal gradient V/(x*) in expectation is bounded above by 


E 


|g‘ - V/(x*)|n < 4Lp‘ + 2 (2L - p) (/(x*) - /(x*) - V/(x*)^(x‘ - x*)) . (34) 


Proof See Appendix [Aj 


Observe that as the sequence of iterates x‘ approaches the optimal argument x*, all the local 
auxiliary variables j converge to x* which follows convergence of p* to null. This observation 
in association with the result in ((Ml) implies that the expected value of the difference between the 
stochastic averaging gradient g* and the optimal gradient V/(x*) vanishes as the sequence of iterates 
X* approaches the optimal argument x*. 


3.2 Convergence 

In this section we establish linear convergence of the sequence of iterates x* generated by DSA to 
the optimal argument x*. To do so define 0 < 7 and T < 00 as the smallest and largest eigenvalues 
of positive definite matrix matrix Z, respectively. Likewise, define 7 ' as the smallest non-zero 
eigenvalue of matrix Z — Z and T' as the largest eigenvalue of matrix Z — Z. Further, define vectors 
u‘, u* G and matrix G G R2Npx2Np 


u 


* 





0 

I 


(35) 


Vector u* G R'^^p concatenates the optimal primal and dual variables and vector u* G Ri^^P contains 
primal and dual iterates at step t. Matrix G G R2Arpx2Arp jg j^^ock diagonal positive definite matrix 
that we introduce since instead of tracking the value of £2 norm ||u* — u*||| we study the convergence 
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properties of G weighted norm ||u‘ — u*||q. Notice that the weighted norm ||u‘ — u*||q is equivalent 
to (u* — u*)^G(u* — u*). Our goal is to show that the sequence ||u‘ — u*||q converges linearly to 
null. To do this we show linear convergence of a Lyapunov function of the sequence ||u* — u*||q. 
The Lyapunov function is defined as ||u‘ — u*|||, + cp* where c > 0 is a positive constant. 

To prove linear convergence of the sequence ||u* — u*||q + cp‘ we first show an upper bound for 
the expected error E [||u*+^ — u*||q | J"*] in terms of ||u* — u*||q and some parameters that capture 
the optimality gap. 


Lemma 4 Consider the DSA algorithm as defined in ([SI)-®. Further recall the definitions of p* in 
(1551) and u‘, u*, and G in (1551) . If AssumptionsUSEhold true, then for any positive constants rj > 0 
we can write 


E[||u*+i-u*||^ I J-‘] < ||u*-u*fe-2E 




-E 




-E 


|v‘+i-v' 


a4L f 

- P 

1 

til2 I jrtl 


(;(^.) _ /(^.) _ v/(x-y(x' - X-)). 


(36) 


Proof See Appendix [Bj 


Lemma m shows an upper bound for the squared norm ||u‘+^ — u*||q which is the first part of 
the Lyapunov function ||u* — u*||q + cp* at step t + 1. Likewise, we provide an upper bound for the 
second term of the Lyapunov function at time t +1 which is in terms of p* and some parameters 

that capture optimality gap. This bound is studied in the following lemma. 

Lemma 5 Consider the DSA algorithm as defined in ([HI)-([SI) and the definition of p* in (1331) . Fur¬ 
ther, define (/min o-^d (/max o,s the smallest and largest values for the number of instantaneous functions 
at a node, respectively. If Assumptions\Tlf^ hold true, then for all t > 0 the sequence p* satisfies 

1 - p* + — [/(x‘) - /(x*) - V/(x*)^(x‘ - X*)] . (37) 

9max J Qmin 

Proof See Appendix [C] ■ 


E [p*+i I J-*] < 


Lemma [5] provides an upper bound for in terms of its previous value p* and the optimality 
error /(x*) — /(x*) — V/(x*)^(x* — x*). Combining the results in Lemmata H] and [5] we can show 
that in expectation the Lyapunov function ||u*+^ — u*||q + c at step t + 1 is strictly smaller 
than its previous value ||u* — u*||q + c p‘ at step t. 


Theorem 6 Consider the DSA algorithm as defined in (©-dSI). Further recall the definition of the 
sequence p* in dMl). Define rj as an arbitrary positive constant chosen from the interval 


S e 


L (/ma 



-L 



(38) 


If AssumptionslTli^ hold true and the stepsize a is chosen from the interval a G (0, ^j^rf), then for 
arbitrary c chosen from the interval 


c G 


/ 4o;Z/(/max 

V V 


4(3(p(/min 

L 


2o((/min(2L /i) \ 

V ) ’ 


(39) 


there exits a positive constant 0 < 5 < 1 such that 


E [||u‘+i - u*||^ + cp*+i I J-*] < (1 - (5) (||u‘ - u*||^ + cp*) . (40) 
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Proof See Appendix [P] 


We point out that the linear convergence constant S in (HUl) is explicitly available - see (IMl) in 
Appendix [D] It is a function of the strong convexity parameter the Lipschitz continuity constant 
L, lower and upper bounds on the eigenvalues of the matrices Z, Z — Z, and I + Z — 2Z, the smallest 
gmin and largest gmax values for the number of instantaneous functions available at a node, and the 
stepsize a. Insight on the dependence of <5 with problem parameters is offered in Section [3.31 

The inequality in (HOI) shows that the expected value of the sequence ||u‘ —u*||q + cp* at time t + I 
given the observation until step t is strictly smaller than the previous iterate at step t. Computing 
the expected value with respect to the initial sigma field E [. | = E [.] implies that in expectation 

the sequence ||u* — u*||q + cp* converges linearly to null, i.e., 

E [||u‘ - u*\\l + cp*] < (I - sy (||u° - n*\\l + cp°) . (41) 

We use the result in HID to establish linear convergence of the sequence of squared norm error 
||x* — x*|p in expectation. 


Corollary 7 Consider the DSA algorithm as defined in ([HI)-® recall 7 is the minimum eigen¬ 
value of the positive definite matrix Z. If the hypothesis of Theorem holds, then there exits a 
positive constant 0 < d < 1 such that 


E[||x*-x*f] <(l-d)* 


+ cp°) . 


(42) 


Proof First note that according to the definitions of u and G in (l3^ and the definition of p* in 
(1^ , we can write 11 x* — x* 111 < 11 u* — u* 11 q + cp*. Further, note that the weighted norm 11 x* — x* 111 
is lower bounded by 7 ||x* — x*|p, since 7 is a lower bound for the eigenvalues of Z. Combine these 
two observations to obtain 7 ||x* — x*|p < ||u* — u*||q + cp*. This inequality in conjunction with 
the expression in (HD) follows the claim in . ■ 


Corollary [7] states that the sequence E [||x* — x*||^] linearly converges to null. Note that the se¬ 
quence E [||x* — x*|p] is not necessarily monotonically decreasing as the sequence E [||u‘ — u*||q -f cp*] 
is. The result in (HTl) shows linear convergence of the sequence of variables generated by DSA in 
expectation. In the following Theorem we show that all local variables x(j generated by DSA almost 
surely converge to the optimal argument of 

Theorem 8 Consider the DSA algorithm as defined in ([6])-@ and assume the same hypothesis of 
Theorem\^ Then, the sequences of local variables x(j for all n = 1,..., N converge almost surely to 
the optimal arguments*, i.e., 

lim x(j = X* a.s. for all n = I,..., W (43) 

t—¥oo 

Further, the almost sure convergence is at least of order 0{l/t). 

Proof See Appendix [El ■ 


Theorem [ 8 ] provides almost sure convergence of x* to the optimal solution x* which is stronger 
result than convergence in expectation as in Corollary [7l however, the rate of convergence for the 
almost sure convergence is sublinear 0{l/t) which is slower relative to the linear convergence in 
expectation provided in (H7|) . 
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3.3 Convergence constant 


The constant 6 that controls the speed of convergence can be simplified by selecting specific values 
for T], a, and c. This uncovers connections to the properties of the local objective functions and the 
network topology. To make this clearer define the condition numbers of the objective function and 
the graph as 


L max{r,r'} 

min| 7 , 7 '| 


(44) 


respectively. The condition number of the function is a measure of how difficult it is to minimize the 
local functions using gradient descent directions. The condition number of the graph is a measure 
of how slow the graph is in propagating a diffusion process. Both are known to control the speed 
of convergence of distributed optimization methods. The following corollary illustrates that these 
condition numbers also determine the convergence speed of DSA. 


Corollary 9 Consider the DSA algorithm as defined in ©-([Q]) and assume the same hypothesis of 
Theorem[^ Choose the weight matrices W and W as W = (1 +W)/2, assign the same number 
of instantaneous local functions fn,i to each node, i.e., qmin = iZmax = g, and set the constants rj, a 
and c as 





(45) 


The linear convergence constant 0 < S < 1 in ioi) reduces to 


6 = min 


1 

idKg ’ 


1 1 
q[l + 4k/(1 + 7 / 7 ')] ’ 4 ( 7 / 7 ')^/ + 


(46) 


Proof The given values for rj, a, and c satisfy the conditions in Theorem [6l Substitute then these 
values into the expression for <5 in (1991) . Simplify terms and utilize the condition number definitions 
in (HU). The second term in the minimization in (1991) becomes redundant because it is dominated 
by the first. ■ 


Observe that while the choices of rj, a, and c in (l45l) satisfy all the required conditions of Theorem 
[ 6 l they are not necessarily optimal for maximizing the linear convergence constant 6. Nevertheless, 
the expression in (H51) shows that the convergence speed of DSA decreases with increases in the 
graph condition number Kg, the local functions condition number Kf, and the number of functions 
assigned to each node q. For a cleaner expression observe that both, 7 and 7 ^ are the minimum 
eigenvalues of the weight matrix W and the weight matrix difference W — W. They can therefore 
be chosen to be of similar order. For reference, say that we choose 7 = 7 ' so that the ratio 7 / 7 ' = 1. 
In that case, the constant 6 in (ITSl) reduces to 


S = min 


1 

16 Kg ’ 


1 

g(l + 8 k/)’ 


1 

4(Kf + Sn'^Kg) 


(47) 


The three terms in (1471) establish separate regimes, problems where the graph condition number 
is large, problems where the number of functions at each node is large, and problems where the 
condition number of the local functions are large. In the first regime the first term in (l47l) dominates 
and establishes a dependence in terms of the square of the graph’s condition number. In the second 
regime the middle term dominates and results in an inverse dependence with the number of functions 
available at each node. In the third regime, the third term dominates. The dependence in this case 
is inversely proportional to Kj. 
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4. Numerical analysis 

We numerically study the performance of the DSA algorithm in solving a logistic regression problem. 
In this problem we are given Q = 9" training samples that we distribute across N distinct 

nodes. Denote (/„ as the number of samples that are assigned to node n. The training points at 
node n are denoted by s„i G Rp for i = 1,..., with associated labels Ini G {—1,1}. The goal is to 
predict the probability P (Z = 1 | s) of having label Z = 1 for sample point s. The logistic regression 
model assumes that this probability can be computed as P (Z = 1 | s) = 1/(1 + exp(—s’^x)) given a 
linear classifier x that is computed based on the training samples. It follows from this model that the 
regularized maximum log likelihood estimate of the classifier x given the training samples (s„i, Ini) 
for i = 1,..., and n = 1,..., iV is the solution of problem 

\ iV Qn 

X* := argmin-||xf+ ^^log 
xeRp 2 

where the regularization term (A/2)||xp is added to reduce overfitting to the training set. The 
optimization problem in (I48|) can be written in the form of o by defining the local objective 
functions /„ as 

fn{^) = — ||xf + ^log(l + exp(-Z™sJ,x)y (49) 

Observe that the local functions /„ in (H51) can be written as the average of a set of instantaneous 
functions fn,i defined as 


(^1 + exp(-Z„iS^,x)), 


(48) 


/ny(x) = ^||xp + g„log(^l + exp(-Z„iS^ix)), (50) 

for all i = 1,... ,q„. Considering the definitions of instantaneous local functions fn^i in (15(71) and 
local functions /„ in (H^ . problem (H51) can be solved using the DSA algorithm. 

In our experiments we use a synthetic dataset where components of the feature vectors s„i with 
label Ini = 1 are generated from a normal distribution with mean fi and standard deviation a^, 
while sample points with label Ini = are generated from a normal distribution with mean —/i 
and standard deviation tT_. We consider a network of size N where the edges between nodes are 
generated randomly with probability Pc- The weight matrix W is generated using the Laplacian 
matrix L of network as 

W = I-L/t, (51) 

where r > (I/2)Amax(L). We capture the error of each algorithm by the sum of squared differences 
of local iterates x^^ from the optimal solution x* as 

N 

e* = ||x*-x*f = ^||x‘-x*f. (52) 

i=l 

We use the total number of sample points Q = 500, feature vectors dimension p = 2, regularization 
parameter A = 10“'^, probability of existence of an edge Pc = 0.3, and r = (2/3)Aniax(L) • To make 
the dataset not linearly separable we set mean to p = 2 and standard deviations to cr+ = cr_ =2. 
We use a centralized algorithm for computing the optimal argument x* in all of our experiments. 

We provide a comparison of DSA with respect to DGD, EXTRA, stochastic EXTRA, and de¬ 
centralized SAGA. The stochastic EXTRA is defined by using stochastic gradient in ([5]) instead 
of using full gradient as in EXTRA or stochastic averaging gradient as in DSA. The decentralized 
SAGA is a stochastic version of DGD algorithm that uses stochastic averaging gradient instead of 
exact gradient which is the naive approach for developing decentralized version of SAGA algorithm. 
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EXTRA a = 5 X 10“^ 
DSA a = 5 X 10"“ 
DGD Q = 10"“ 
D-SAG a = 10"“ 
DGD Q = 10“2 
D-SAG Q = 
sto-EXTRA a = 10“^ 
sto-EXTRA a = 10"^ 


300 400 500 600 700 

Number of iterations 


800 900 1000 


Figure 2: Convergence of DSA, EXTRA, DGD, Stochastic EXTRA, and Decentralized SAGA. Rel¬ 
ative distance to optimality e* = ||x‘ — x* ||^ is shown with respect to number of iterations 
t. DSA and EXTRA converge linearly to the optimal argument x*, while DGD, Stochastic 
EXTRA, and Decentralized SAGA with constant step sizes converge to a neighborhood 
of the optimal solution. Smaller choice of stepsize leads to more accurate convergence for 
these algorithms. 


In our experiments the wight matrix W in EXTRA, stochastic EXTRA, and DSA is chosen 
as W = (I -I- W)/2. Fig. [2] illustrates the convergence paths of DSA, EXTRA, DGD, Stochastic 
EXTRA, and Decentralized SAGA with constant step sizes for N = 20 nodes. For EXTRA and 
DSA different stepsize are chosen and the best performance for EXTRA and DSA are achieved by 
a = 5 X 10“^ and a = 5 x 10“^, respectively. As shown in Fig. DSA is the only stochastic 
algorithm that achieves linear convergence. Decentralized SAGA after couple of iterations achieves 
the performance of DGD and they both can not achieve exact convergence. By choosing smaller 
stepsize a = 10“^ they reach more accurate convergence relative to stepsize a = 10“^, however, the 
speed of convergence is slower for the smaller stepsize. Stochastic EXTRA also suffers from inexact 
convergence, but for a different reason. DGD and decentralized SAGA have inexact convergence 
since they solve a penalty version of the original problem, while stochastic EXTRA can not reach 
the optimal solution since the noise of stochastic gradient is not vanishing. DSA resolves both issues 
by combining the idea of stochastic averaging from SAGA to control noise of stochastic gradient and 
using the double decentralized descent idea of stochastic EXTRA to solve the correct optimization 
problem. Convergence rate of EXTRA is faster than DSA in terms of number of iterations or 
equivalently number of communications, however, the complexity of each iteration for EXTRA is 
higher than DSA. Therefore, it is reasonable to compare performances of these algorithms in terms 
of number of processed feature vectors. For instance, DSA requires 400 iterations or equivalently 400 
feature vectors to achieve the error e‘ = 10“^, while to achieve the same accuracy EXTRA requires 
60 iterations which is equivalent to processing 60 x 25 = 1440 feature vectors. These numbers show 
the advantage of DSA relative to EXTRA in requiring less processed feature vectors for achieving a 
specific accuracy. 

We study performances of the DSA algorithm for different topologies. We keep the parameters 
in Fig. [2] except we change the size of network to N = 100 which implies each node has Qi = 5 
sample points. The linear convergence of DSA algorithm for random networks with pc = 0.2 and 
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Figure 3: Convergence of DSA for different network topologies. Relative distance to optimality e* = 
||xt — x*|p is shown with respect to the number iterations t. DSA has faster convergence 
in more connected networks. 


Pc = 0.3, complete graph, cycle, line and star are shown in Fig. |3l As we expect for the topologies 
that the graph is more connected and the diameter is smaller linear convergence of DSA is faster. 
The best performance belongs to the complete graph which requires 160 iterations to achieve the 
relative error e* = 10“®. For random graphs with connectivity probabilities Pc = 0.3 and Pc = 0.2 
DSA achieves the relative error e* = 10“® after t = 210 and t = 280 iterations, respectively. For the 
cycle graph the number of required iterations for reaching the relative error e* = 10“® is t = 470, 
while DSA does not reach this accuracy after t = 1000 iterations when the graph is a line or star. 

5. Conclusions 

Decentralized double stochastic averaging gradient (DSA) is proposed as an algorithm for solving 
decentralized optimization problems where the local functions can be written as an average of a set 
of local instantaneous functions. DSA exploits stochastic averaging gradients in lieu of gradients 
and mixes information of two consecutive iterates to determine the descent direction. By assuming 
strongly convex local instantaneous functions with Lipschitz continuous gradients, the DSA algo¬ 
rithm converges linearly to the optimal arguments in expectation. In addition, the sequence of 
local iterates x(j for each node in the network almost surely converges to the optimal argument 
X* . A comparison between the DSA algorithm and a group of stochastic and deterministic alterna¬ 
tives are provided for solving a logistic regression problem. The numerical results show DSA is the 
only stochastic decentralized algorithm to reach linear convergence. DSA outperforms decentralized 
stochastic alternatives in terms of number of required iteration for convergence, and exhibits faster 
convergence relative to deterministic alternatives in terms of number feature vectors processed until 
convergence. 
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Appendix A. Proof of Lemma [3] 


According to the definition of g* which is the concatenation of local stochastic averaging gradients 
g^ and the fact that expected value of sum is equal to sum of expected values, we can write the 




as 


expected value E ||g‘ — V/(x’' 

|g‘-V/(x*)f I =^E ||gJ,-V/„(x*)f I 


N 


E 


n—l 


(53) 


We proceed by finding upper bounds for the summands of (l5^ . Observe that using the stan¬ 
dard variance decomposition for any random variable vector a we can write E [||a|p] = ||E [a] |p -|- 
E [||a—E[a] |p]. Notice that the same relation holds true when the expectations are computed 
with respect to a specific field By setting a = g(j — V/„(x*) and considering the fact that 
E [a I J"*] = V/„(x(j) — V/„(x*), the variance decomposition implies 


E 


- V/„(x*)f I J-* = \\VfM - V/„(x*)| 


-l-E 


Ig), - V/„(x*) - WfM + V/„(x*)f I J-* 


(54) 


The next step is to find an upper bound for the last term in (IMl) . Adding and subtracting (x*) 

and using the inequality ||a-h b|p < 2||a|p 2||b|p for a = V- V/„_it^(x*) - V/„(x(,) -h 

V/„(x*) and b = - V/„,it^(x*) - (l/q„) J2iZi V/™.i(y«,,) + V/„(x*)) lead to 


E 


Hi - V/„(x*) - V/„(x(,) + V/„(ic*)f I 


(55) 


< 2E y|V/„,,*Jx(,)-V/„,,tjr)-V/„(x(,) + V/„(x*)|n 

Qn 

+ 2E 


(ylii (x*) - - ^ V/„.,(y^,,)+V/„(i*) 

Qn 


\r 


In this step we use the standard variance decomposition twice to simplify the two expectations 
in the right hand side of dSSl). Notice that according to the standard variance decomposition 
E [||a — E [a] |p] = E [||a|p] — ||E [a] |p we obtain E [||a — E [a] |p] < E 


y = ^!n,i\ {yi it )—yfn,i*„ (x*) and observing that the expected value E 
is equal to (l/g„) J2i=i '^fnAyi,t) “ V/„(x*) we obtain that 


I . Therefore, by setting 

Wn.zj,(yLO-v/„,.j,(x*) I 


E 


V/„.,*(y^,.0-V/„.,5,(i*)--^V/„.,(y),,,) + V/„(x*) 

i=i 


(Jn 




< E 


WnA(yU)-W"A(**) 


J-* 


(56) 


Moreover, by choosing a = (x(j) — (x*) and noticing the relation for the expected value 

which is E (x(j) - V/„,it^(V) | = V/„(x(j) - V/„(x*), the equality E [||a - E [a] [p] = 

E [l|af] - ||E[a] yields 


E 


\^fu,tAO - - yfuAi) + V/„(r)|n P 


= E 




V/„(x(,)-V/„(x*)f. (57) 
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Substituting the upper bound in (1551) and simplification in (1571) into (1551) . and considering the 
expression in (1551) lead to 


N 


E 


|g‘-V/(x*)|n J-* <2^E 




J"* 


N 


n—1 


N 


+ 2^E ||v/„,,*„(x^)- 


x^f I J-* 


-^||V/„(x^)-V/„(i*)| 
(58) 


n—1 


We proceed by finding an upper bound for the first sum in the right hand side of (155|) . Notice that if 
gradients of function g are Lipschitz continuous with parameter L, then for any two vectors ai and 
a 2 we can write g(ai) > g{a 2 ) + V 5 (a 2 )^(ai — a 2 ) + (1/2L)||Vg(ai) — \7g{a2)\\‘^. According to the 
Lipschitz continuity of instantaneous local functions gradient V/n,z(x„), we can write the inequality 
for g = fn,i, ai = j and a 2 = x* which is equivalent to 


2L 


|V/n.z(yn,i) - V/„,i(x*)|| < /„.i(y„,i) - fnA^*) “ V/n,i(x*) - X*). 


(59) 


Summing up both sides of (|59)) for all z = 1,..., dividing both sides of the implied inequality by 
Qn lead to 


^ Qn 

- E - v/n.,(i*)f < 2L 

i=i 


Qn 


Qn 


E - V/„,*(x*)^(y^ ^-x*) 


2=1 


(60) 


Since the random functions fn^i* has a uniform distribution over the set 

r 2 


substitute the left hand side of (|6ni) by E 


v/n.,‘(y‘,0-v/„R(i*) 


{/n,l, 

I . 


• ■ ■Jn,q„}, we can 
Apply this substi¬ 


tution and sum up both sides of ((501) for n = 1,..., A^. According to the definition of sequence p* 
in ((33l) . if we sum up the right hand side of (l60l) over n it can be simplified as 2LpL Applying these 
simplifications we obtain 


N 

E 

n—1 


E 


|V/„p*(y),)-V/„,e*(x*)f| 


< 2Lp* 


(61) 


Substituting the upper bound in (l6T1) into (l58|) and simplifying the sum J2n=i fnAn) — V/„(x*)||^ 
as ||V/(x*)-V/(x*)f yield 


N 


E 


|g‘-V/(x*)|n J-* <2^E ||V/„,, 5 ^(x(,)-V/„,,^(x*)|n J-* -||v/(x‘)-V/(x*) 


n—1 


+ ALp\ 


(62) 


To show that the sum in the right hand side of (1621) is bounded above we use the Lipschitz continuity 
of the instantaneous functions gradients fn,i- Using the same argument from (|59l) to (1611) we can 
write 


N 


n—1 


^E ||V/„,,*Jx(,)-V/„,,^(x*)|n J-’ 

w , r 9n 

< 2LE“ 


Qn L • 1 

n=l 2=1 


(63) 
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Considering the definition of the local objective functions /n(x„) = (l/(?ra) ^^.d the 

aggregate function /(x) := J2n=i fni^n), the right hand side of (l63l) can be simplified as 


N 


n—1 




r 


< 2L (/(x‘) - /(x*) - V/(x*)^(x‘ - X*)) . (64) 


Replacing the sum in (15^ by the upper bound in (IMl) implies 


E 



v/(x*)in 


< 4Lp‘ 


V/(x‘) - V/(x*)f + 4L (/(x‘) - /(x*) - V/(x*)^(x‘ - X*)) . 

(65) 


Considering the strong convexity of function / with constant /r we can write 

||V/(x‘) - V/(x*)f > {fix*) - fix*) - V/(x*)^(x‘ - X*)) . (66) 

Therefore, we can substitute ||V/(x‘) — V/(x*)|p in ((Ml) by the lower bound in (|M)) and the claim 
in (IMl) follows. 


Appendix B. Proof of Lemma [4] 

According to the Lipschitz continuity of the aggregate function gradients V/(x) we can write 
(l/L)||V/(x*) - V/(x*)f < (x* - x*)'^(V/(x‘) - V/(x*)). By adding and subtracting x*+i to 
the term x* — x* and multiplying both sides of the inequality by 2a we obtain 

^||V/(x‘)-V/(x*)f < 2a(x*+i - x*)^(V/(x‘) - V/(x*)) + 2a(x* - x‘+i)^(V/(x‘)-V/(x*)). 

(67) 

Expanding the difference V/(x‘) — V/(x*) as g* — V/(x*) + V/(x*) — g* for the first inner product 
in the right hand side of (|67|) implies 

^||V/(x‘)-V/(x*)f < 2a(x* - x‘+i)^(V/(x*) - V/(x*)) + 2a(x*+i - x*)^(g* - V/(x*)) 

+ 2a(x*+i - x*)^(V/(x*) - g‘). (68) 

We proceed to simplify the inner product 2a(x‘+^ — x*)^(g‘ — V/(x*)) in the right hand side of ((68l) 
by substituting a(g* — V/(x*)) with its equivalent as introduced in (1501) . Applying this substitution 
the inner product can be simplified as 

2a(x*+i - x*)^(g‘ - V/(x*)) = -2||x*+i - + 2(x‘+i - x*)^Z(x* - x‘+i) 

- 2(x*+i - x*)^U(v*+i - V*). (69) 

First notice that according to the KKT condition of problem (1^51) the optimal primal variable 
satisfies (Z — Z)^/^x* = 0 which by considering the definition of matrix U = (Z — Z)^/^ we obtain 
that Ux* = 0. This observation in associations with the update rule of dual variable v* in (1^51) 
implies that we can substitute U(x*+^ ~ ^*) by — v‘. Making this substitution into the last 
summand of the right hand side of (1691) and considering the symmetry of matrix U yield 

2a(x*+i - x*)^(g‘ - V/(x*)) = -2||x*+i - + 2(x‘+i - x*)^Z(x* - x*+^) 

-2(v‘+i - v‘)^(v‘+i - V*). (70) 

According to the definition of vector u and matrix G in ((Ml) , the last two summands of (1701) can 
be simplified as 2(u*+^ — u*)^G(u* — u*+^). Moreover, observe that the inner product 2(u*+^ — 
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u*)’^G(u* — u‘+^) can be simplified as ||u* — u*||q — ||u‘+^ — u*||q — ||u‘+^ — u*||q. Applying this 
simplification into (1701) implies 

2a(x‘+i - x*)^(g‘ - V/(x*)) = -2||x‘+i - + ||u‘ - u*\\l - ||u‘+i - u*\\l 

-||u‘+i-u‘fG- (71) 


The next step is to bound above the inner product 2 q!(x* — x*+^)^(V/(x‘) — V/(x*)). Note that 
for any two vectors a and b, and any positive scalar rj the inequality 2a^b <77||a|p + 77 ^||b|p holds 
true. Therefore, by setting a = x* — x*+^ and b = V/(x*) — V/(x*) we obtain that 

2a(x‘ - x‘+i)^(V/(x‘) - V/(x*)) < -||V/(x‘) - V/(x*)f + ot^Hx* - x‘+if ^ (72) 

77 

Now we substitute the terms in the right hand side of (1681) by their simplifications or upper bounds. 
Replacing the inner product 2a(x‘+^ — x*)^(g* — V/(x*)) by the simplification in (jTlT) . substituting 
expression 2a(x* — x*+^)^(V/(x*) — V/(x*)) by the upper bound in (|72]) . and substituting inner 
product 2a(x*+^—x*)^(V/(x*)—g*) by the sum 2a(x‘—x*)’^(V/(x*)—g*)+2a(x‘+^—x‘)^(V/(x*) — 
g‘) imply 


U 


|V/(x‘)-V/(x*)f <-2|| 


.t+i 


— X 


II+Z-2Z 


u — u 


- U' 


t+1 


— u 


1^ 


- ||u‘+i - u‘fG + - x‘+if + ^||V/(x‘) - V/(x*)f 

+ 2a(x‘ - x*)^(V/(x*) - g‘) + 2a(x*+i - x*)^(V/(x‘) - g‘). (73) 


Considering that x* — x* is deterministic given observations until step t and observing the relation 
E [g‘ I J"*] = V/(x*), we obtain that E [(x‘ — x*)^(V/(x*) — g‘) | J"*] = 0. Therefore, by computing 
the expected value of both sides of (1731) given the observations until step t and regrouping the terms 
we obtain 


u — u 


IG 


-Eriiu‘+i-u 


IG I 7^*] > « ( T - - ) l|V/(^‘) - V/(x*)|| Ve [||u‘+i - u‘fe 


+ 2E 


J\T 


— arjlK [||x^ — x' 


.t+l II2 


J-* 


J-* 


- E [2a(x‘+i - x‘)^(V/(x‘) - g*) I J-‘] . 


(74) 


By applying inequality 2a^b < 77 ||a|p + ? 7 “^||b|p for the choice of vectors a = x‘+^ — x* and 
b = V/(x‘) — g*, we obtain that 2(x*+^ — x*)'^(V/(x*) — g‘) is bounded above by ? 7 ||x*+^ — 
x*|p + ( 1 /? 7 )||V/(x*) — g*|p. Replacing 2(x‘+^ — x‘)^(V/(x‘) — g‘) in (1711) by its upper bound 
77 ||x*+i - x*|p + (l/r 7 )||V/(x*) - g*|p yields 


u — u 




-Eriiu*+i-u 


'*'1^ 


> 


V/(x*) - V/(x*) +E [||u‘+i - u 


■*'IgI-^‘; 


2E 


a. 


--E[||V/(x‘)-g 
77 


I+Z-2Z I 
t||2 I jrt 




- 2 a 77 E [||x* -x*+if I J-*] 
(75) 


According to the definitions of vector u and matrix G in (1151) the squared norm ||u‘+^ — u‘||q can 
be expanded as ||x*+^ — x‘||| + ||v‘+^ — v‘||^. Making this simplification for ||u*+^ — u*||q and 
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regrouping the terms in (f75|l lead to 


u — u 


IG 


-Eriiu*+i -u 


||V/(x‘)-V/(x*)| 


(76) 


+ E 
+ 2E 




-Eriiv*+i-v*iiu 


-^E [||V/(V)-g‘f 


We proceed by simplifying E [||V/(x‘) — g‘|p| J"*] in (1751) . Note that by adding and subtracting 
V/(x*) the expectation can be written as E [|| V/(x*) — V/(x*) + V/(x*) — g‘p | J"*] and by ex¬ 
panding the squared norm and simplifying the terms we obtain 


E 


|V/(V)-g‘f 1^ 


= E 


ig‘-v/(x*)in 


-E 


|V/(x‘)-V/(x*)f 


(77) 


Substituting the simplification in (iTTl) into (1761) yields 


2 a 


|u‘ - u*\\l - E [||u*+i - n*\\l \T^]>— ||V/(x‘) - V/(x*)| 


(78) 


-E 

-2E 




-E riiv‘+i - V 


.t\\2 


r 


-“E[||g‘-V/(x*)f l-F*] 


Considering the strong convexity of function / with constant we can write ||V/(x‘) — V/(x*)|l^ > 
2^l (/(x‘) - /(x*) - V/(x*)'^(x* - X*)). substituting the squared norm ||V/(x*) - V/(x*)f by 
this lower bound in (1751) follows 




g-E[I|u‘+'-u*||^ I J-*] > 


4a/r 

ur 

-l-E 
-I- 2E 


(/(x‘)-/(x*)-V/(x*)^(V-x*)) 

t+i 


(79) 




E[||v*+^-v*r I J-* 


'* -“E[||g‘-V/(x*)f |-F‘] 

Tj ■' 


Substituting the upper bound for the expectation E [||g‘ — V/(x* 
regrouping the terms show validity of the claim in (1551) . 


m 


([M|) into dZHl) and 


Appendix C. Proof of Lemma [5] 

Given the information until time t, each auxiliary vector ^ random variable that takes values 

yW and x(j with associated probabilities 1 — l/g„ and l/qn, respectively. This observation holds 
since with probability l/q-n node n may choose index i to update at time t -I- 1 and with probability 
1 — {l/qn) choose other indices. Therefore, we can write 


E 


1 

Qn 


Qn 




r 11 

J-* 

= 

1 — 



qn_ 


Qn 


— X!^A.*(x*)^(y^,i-x*) 

Qn 


2=1 


+ —V/„(x*) (x„-x*). 

Qn 


(80) 


Likewise, the distribution of random function fn,i{y//'i) given observation until time t has two 
possibilities fn,i{yh i) and /„,i(x(j) with associated probabilities 1 — l/q„ and l/qn, respectively. 
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Hence, we can write E | = (1 - i/<ln)fn,i{yij) + (l/g„)/„,j(x^). By summing this 

relation for all* S 1,... ,qn and divining by qn we obtain 


E 


-I Qn 

—'^fnAy 


t+1 \ 
n,2 / 




r 11 


= 

1 — 



Qn _ 


1 ^ 


— y JnAyn,^) + — /n(x„) 


Qn 




Qn 


For the simplicity of equations let us define sequence as 


(81) 


P 


t 

n 


-1 Qn 

— - /n(x*) 

Qn ._T 


Qn 


Qn 

■E 


V/„,4x*)^(y^_, -X*). 


(82) 


Subtracting (1501) from (|5T|) and adding —fniA*) to the both sides of equality in association with the 
definition of sequence p(j in (1501) yield 


E [p. 


*+i 


r\ = 


1 - 


qn 


qn 


[fnAn) - /n(x*) - V/„(x*)^(x(, - X*)] . 


(83) 


We proceed to find and upper bound for the terms in the right hand side of (1501) . First note that 
according to the strong convexity of instantaneous functions fn,i and /„ both terms in the right 
hand side of (1831) are non-negative. Observing that the number of instantaneous functions at each 
node qn satisfies the condition Qmin < qn < 9max, we obtain 


1 - — < 1 - — 

Qn Qmax. 


1 1 
- < -■ 

Qn Qmin 


(84) 


Substituting the upper bounds in ((5^ into (l83l) . summing both sides of implied inequality over 
n G {1, ..., N}, and considering the definitions of optimal argument x* = [x*;...; x*] and aggregate 
function /(x) = J2n=i /n(x„) lead to 


N 


E [p 


t+1 


< 


1 - 


Qma 


N 

n—l 


Qmh 


[/(x‘) - /(x*) - V/(x*)^(x‘ - X*)] . 


(85) 


Now observe that according to the definitions of sequences p* and Pn in (l33l) and (1521) . respectively, 
p* is the sum of p(j for all n, i.e. p* = J2n=i Pn- Therefore, we can rewrite (1551) as 


E [p*+i I J'*] < 1 


1 

Qmax. 


1 


^min 


[/(x‘)-/(x*)-V/(x*)^(x*-x*)]. 


Therefore, the claim in ((571) is valid. 


( 86 ) 


Appendix D. Proof of Theorem [6] 

To prove the result of Theorem [5] first we prove the following Lemma to establish an upper bound 
for ||v*-v*||2. 


Lemma 10 Consider the DSA algorithm as defined in (jS])-®- Further, recall 7 ' as the smallest 
non-zero eigenvalue and F' as the largest eigenvalue of matrix Z — Z. If Assumptions^^ [H and\^ 
hold true, then the squared norm of difference ||v‘ — v*|p is bounded above as 


||v‘ - v*f < —E 
7 


|x*+i-x* 


l(I+Z-2Z)5 




-E 




iqAl 

i 


2r' 


+ —E [||v*-v 


t+11|2 


+ + 


SA (2L — p) 


[/(x‘)-/(x*)-V/(x*)V-x*)] . (87) 
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Proof Consider the basic inequality ||a + b|p < 2||a|p + 2||b|p for the case that a = U(v*+^ ~ v*), 
b = U(v* — v*+^) which can be written as 

||U(v* - v*)f < 2||U(v‘+i - v*)f + 2||U(v* - v*+i)f. (88) 


We proceed by finding an upper bound for 2||U(v*+^ — v*)|p. Based on the result of Lenima[5]in 
(l30l) . the term U(v*+^ ~ v*) is equal to the sum of vectors a + b where a = (I + Z — 2Z)(x‘+^ — 
X*) — Z(x* — x*+^) and b = —og* — V/(x*). Therefore, using inequality ||a + b|p < 2||a|p + 2||b|p 
we can write 


U(v‘+i-v*) <2 (I + Z-2Z)(x‘+i-x*)-Z(x‘-x‘+i) + 2a^ |g‘- V/(x*) 


(89) 


By using inequality ||a + b|p < 2||a|p + 2||b|p one more time for vectors a = (1 + Z — 2Z)(x‘+^ — x*) 
and b = —Z(x*—x‘+^), we obtain a upper bound for the term || (I+Z—2Z)(x*+^—x*) —Z(x*—x‘+^)|p 
and substituting this upper bound into (1891) using the definition of weight norm lead to 


||U(v‘+i 



.‘+1 


2 

(I+Z-2Z)2 


+ 4 llx* - x‘+^ 


2 

Z2 


+ ‘2a^W 


V/(x*) 


(90) 


Inequality (1^ shows an upper bound for 2||U(v‘+^ — v*)|p in (l88l) . Moreover, we know that 
the second term ||U(v* — v*+^)|p is also bounded above by r'||v* — where T' is the largest 

eigenvalue of matrix Z — Z = U^. Substituting these upper bounds into (|551) and computing the 
expected value of both sides given the information until step t yield 


||U(v*-v*)||" < 8E 


|x‘+i-x* 


l(I+Z-2Z) 


J"* 


■8E 


|x‘-x‘+H 


J"* 


+ 4a E 


|g‘-V/(x*)|n J-* 


+ 2r'En|v*-v 


,i+l||2 


. 


(91) 


Note that according to the fact that both v* and v* lie in the column space of matrix U we obtain 
||U(v* — v*)|p > 7 '||v* — v*|p. Substituting this lower bound for ||U(v‘ — v*)|p in (IMll and 
multiplying both sides of the imposed inequality by 7 ' yield 


||v‘-v*|p<iiE 

7 


lx*+i-x*lr - I j"* 

1 ^ ^ ll(I+Z-2Z)2 I 


4a^ 

7 ' 


E 


ig‘-v/(x*)in^‘ 


-E 


2r' 


|x*-x*+i||l, I 


+ —E[||v‘-v‘+if I J-*] . (92) 


Substituting E [||g‘ — V/(x*)|p | in the right hand side of (IMI) by its upper bound in (IM)) fol¬ 
lows the claim in (IFZl) . ■ 


Using the result in Lemma fTOl we show linear convergence of the sequence ||u‘ — u* ||q -f c p* as 
follows. 

Proof of Theorem [6l Proving the linear convergence claim in (IM|) is equivalent to showing 
that 


J||u‘-WfG+<5cp‘< ||u‘-Wfe-E[||u‘+i-u*||^ I J-*] +c(p‘-E[p‘+i I J-*]). 


(93) 


Substituting the terms E [||u‘+^ — u*||q | and E | J-*] by their upper bounds as introduced 
in Lemma U] and Lemma [SJ respectively, yield a sufficient condition for the claim in (1931) as 


(5||u‘ — u*||q -I- dc p* < E 




■E[|| 


r*+l 


rt\\2 


-‘^^--1llTZ-2zl-^‘ 


2E 

4ap 2a{2L — fj,) 


- v7r I 

c 4aL\ 


Qm. 


ax 


1 


-) 


ri 


^miii 


[/(x‘) - /(x*) - V/(x*)^(x‘ - X*)] . (94) 
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We emphasize that if inequality (IMl) holds then the inequalities in (IMl) and (HUl) are valid. Note that 
11 u* — u* 11Q in the left hand side of (IMl) can be simplified as 11 x* — x* 111 +11 v* — v* 11 ^. Considering the 
definition of T as the maximum eigenvalue of matrix Z, we can conclude that ||x‘ — x*||| is bounded 
above by r||x‘ — x*|p. Considering this relation and observing the upper bound for ||v‘ — v*|p in 
(|F7|) . we obtain that ||u* — u*|||, = ||x* — x*||| + ||v‘ — v*p is bounded above as 


|u‘-W||^ < -E 
7 


|x*+i-x*ll - I J"* 

1^ ^ ll(I+Z-2Z)2 I 


2r', 

7 ' 

8a2 (2L - n) 


-E 

7 

t 


x*-x‘+i||^,|J-‘ 


16a^L 

7 / 


+ —E[||v‘-v*+if I j-*] +r||x*-x^ 


7 


[/(x‘)-/(x*)-V/(x*f(x‘-x*)]. 


(95) 


Further, substitute the squared norm ||x*—x*|p by the upper bound (2//r)(/(x‘)—/(x*)—V/(x*)'^(x*— 
X*)) to obtain 


|u*-u*"2 


cj<-E 

7 


lx*+i-x*lr - I j"* 

I ll(I+Z-2Z)2 I 


-E 


|x‘-x*+if^, |.F‘ 


IQa^L 


r 


2 r' 


+ —E[||v‘-v*+if I j-*] 


^ _ Vf(K-f(K‘ - X-)] , 


(96) 


Replacing ||u‘ — u*||q in (IMl) by the upper bound (IMl) and regrouping the terms lead to 


0 < E 




-E 


|x*+i_x*"2 


Ir 1 1 

" (I+Z-2Z)i [2I-^(I+Z-2Z)] (I+Z-2Z)^ 


llJ-* 


-E 


4aL , leSa^L 
- dc -;- 


^ t' •’ J L9max v J 

Aa^l 2a{2L-^i) c 86a^ (2L - fj.) 2ST 
L rj 9inin 7 ' M 


(/(x‘)-/(x*)-V/(x*)^(x‘-x*)). 


(97) 


Notice that if the inequality in (1^71) holds true, then the relation in (IMl) is valid and as we mentioned 
before the claim in (IMl) holds. To verify the sum in the right hand side of (IWl) is always positive and 
the inequality is valid, we enforce each summands in the right hand side of (1971) to be non-negative. 
Therefore, the following conditions should be satisfied 

7 - a(r/+ r/) - > 0, 2 - Z - 2Z) > 0, 1 - ^ > 0, 

/-y/ 

c AaL IdSa^L Aafi 2a{2L — ^i) c 86a^ (2L — fi) 2(5r 

- dc - - —> 0, —- - -> 0. (98) 

gmax r] 7 ' L rj gmin 7 M 

Recall that 7 is the smallest eigenvalue of positive definite matrix Z. All the inequalities in (IMt are 
satisfied, if <5 is chosen as 


5 = min 


(7 - 2 a? 7 ) 7 ' _y_ 7 ^ 7 '(c 7 ? - daLg^ax) 

sr^ ’ AXmaxil +Z-2Z)’ 2r' ’ r7gmax(c7' -b Ibcx^L) ’ 


dap. 

~ir 


2a{2L — p.) 

■n 


c 


Qmin 


8a^ (2L -fi) _^2T 

i M 



(99) 
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where ly, c and a are selected from the intervals 

{L Qmax L L \ / 1 \ ^ /doZ/I/max doyRymin 

V e -+- -,oo,aeO, — ,ce -, —-— 

V Wmin yr 2 / V 2r]J \ fl L 


2oiyniin(2A /i) 


( 100 ) 


Notice that considering the conditions for the variables ry, a and c in (jlOOp . the constant S in ((Ml) is 
strictly positive <5 > 0. Moreover, according to the definition in ((Ml) the constant <5 is smaller than 
7 '/ 2 r' which leads to the conclusion that J < 1/2 < 1. Therefore, we obtain that 0 < J < 1 and the 
claim in (HU)) is valid. 


Appendix E. Proof of Theorem [8] 

The proof uses the relationship in the statement (HOt of Theorem [D to build a supermartingale 
sequence. To do this define the stochastic processes C* and /3‘ as 

C‘:=||u‘-u*||^ + cp‘, /3‘:=<5(||u‘-u*||^ + cp‘). (101) 

Note that the stochastic processes C* and /3* are alway non-negative. Let now J-t be a sigma-algebra 
measuring C*, /3*, and uL Considering the definitions of C* and /3‘ and the relation in (HOI) we can 
write 


E [C*+^ I < C‘ -/3‘- 


( 102 ) 


Since the sequences a* and /3‘ are nonnegative it follows from ( 1102)1 that they satisfy the conditions of 
the supermartingale convergence theorem - see e.g. theorem E7.4 ^o and Kond ( 1995 1 . Therefore, 


we obtain that: (i) The sequence converges almost surely, (ii) The sum < oo is almost 

surely finite. The definition of /3* in (IIOII) implies that 




u — u 


-|- cp*) < oo, a.s. 


(103) 


Since 11 x* — x* 111 < 11 u* — u* 11 q -|- cp* and the eigenvalues of Z are lower bounded by 7 we can write 
7 ||x* — x*|p < ||u‘ — u*||q -|- cp*. This inequality in association with the fact that the sum in (11031) 
is finite leads to 

OO 

^ (57 ||x‘— x*||^ < 00 , a.s. (104) 

t=Q 

Observing the fact that 6 and 7 are positive constants, we can conclude from (I104p that the sequence 
||x* — x*|p is almost surely summable and the it converges with probability 1 to null at least in the 
order of Almost sure convergence of sequence to null follows the claim in (1431) . 
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